472 research outputs found

    Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition without Length Bias

    Full text link
    As one popular modeling approach for end-to-end speech recognition, attention-based encoder-decoder models are known to suffer the length bias and corresponding beam problem. Different approaches have been applied in simple beam search to ease the problem, most of which are heuristic-based and require considerable tuning. We show that heuristics are not proper modeling refinement, which results in severe performance degradation with largely increased beam sizes. We propose a novel beam search derived from reinterpreting the sequence posterior with an explicit length modeling. By applying the reinterpreted probability together with beam pruning, the obtained final probability leads to a robust model modification, which allows reliable comparison among output sequences of different lengths. Experimental verification on the LibriSpeech corpus shows that the proposed approach solves the length bias problem without heuristics or additional tuning effort. It provides robust decision making and consistently good performance under both small and very large beam sizes. Compared with the best results of the heuristic baseline, the proposed approach achieves the same WER on the 'clean' sets and 4% relative improvement on the 'other' sets. We also show that it is more efficient with the additional derived early stopping criterion.Comment: accepted at INTERSPEECH202

    Language Modeling with Deep Transformers

    Full text link
    We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our word-level models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive self-attention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models.Comment: To appear in the proceedings of INTERSPEECH 201

    Context-Dependent Acoustic Modeling without Explicit Phone Clustering

    Full text link
    Phoneme-based acoustic modeling of large vocabulary automatic speech recognition takes advantage of phoneme context. The large number of context-dependent (CD) phonemes and their highly varying statistics require tying or smoothing to enable robust training. Usually, Classification and Regression Trees are used for phonetic clustering, which is standard in Hidden Markov Model (HMM)-based systems. However, this solution introduces a secondary training objective and does not allow for end-to-end training. In this work, we address a direct phonetic context modeling for the hybrid Deep Neural Network (DNN)/HMM, that does not build on any phone clustering algorithm for the determination of the HMM state inventory. By performing different decompositions of the joint probability of the center phoneme state and its left and right contexts, we obtain a factorized network consisting of different components, trained jointly. Moreover, the representation of the phonetic context for the network relies on phoneme embeddings. The recognition accuracy of our proposed models on the Switchboard task is comparable and outperforms slightly the hybrid model using the standard state-tying decision trees.Comment: Submitted to Interspeech 202

    Improved training of end-to-end attention models for speech recognition

    Full text link
    Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model.Comment: submitted to Interspeech 201

    ΠœΠ°Ρ‚Π΅Ρ€ΠΈΠ°Π»ΡŒΠ½Π°Ρ ΠΈ духовная ΠΊΡƒΠ»ΡŒΡ‚ΡƒΡ€Π° армян ΠšΠ°Ρ€Π°Π±Π°Ρ…Π°: ΠΏΡ€ΠΎΠ±Π»Π΅ΠΌΡ‹ развития ΠΈ сохранСния Π½Π°Ρ†ΠΈΠΎΠ½Π°Π»ΡŒΠ½ΠΎΠ³ΠΎ ΠΊΡƒΠ»ΡŒΡ‚ΡƒΡ€Π½ΠΎΠ³ΠΎ наслСдия Π² 1920-1990-Ρ… Π³Π³.

    Get PDF
    Π Π°ΡΡΠΌΠ°Ρ‚Ρ€ΠΈΠ²Π°ΡŽΡ‚ΡΡ вопросы ΠΌΠ°Ρ‚Π΅Ρ€ΠΈΠ°Π»ΡŒΠ½ΠΎΠΉ ΠΈ Π΄ΡƒΡ…ΠΎΠ²Π½ΠΎΠΉ ΠΊΡƒΠ»ΡŒΡ‚ΡƒΡ€Ρ‹ армян ΠšΠ°Ρ€Π°Π±Π°Ρ…Π°, Π° Ρ‚Π°ΠΊΠΆΠ΅ ΠΏΡ€ΠΎΠ±Π»Π΅ΠΌΡ‹ развития ΠΈ сохранСния ΠΈΡ… Π½Π°Ρ†ΠΈΠΎΠ½Π°Π»ΡŒΠ½ΠΎΠ³ΠΎ ΠΊΡƒΠ»ΡŒΡ‚ΡƒΡ€Π½ΠΎΠ³ΠΎ наслСдия Π² 1920-1990-Ρ… Π³Π³

    Об ΠΎΠ΄Π½ΠΎΠΌ ΠΌΠ΅Ρ‚ΠΎΠ΄Π΅ контроля работоспособности ΡΠ΄Π²ΠΈΠ³Π°ΡŽΡ‰Π΅Π³ΠΎ рСгистра

    Get PDF
    ΠžΠΏΠΈΡΡ‹Π²Π°Π΅Ρ‚ΡΡ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠΎΠ½Π°Π»ΡŒΠ½Π°Ρ схСма устройства контроля работоспособности ΡΠ΄Π²ΠΈΠ³Π°ΡŽΡ‰Π΅Π³ΠΎ рСгистра, основанного Π½Π° ΠΌΠ΅Ρ‚ΠΎΠ΄Π΅ ΡƒΡ‡Π΅Ρ‚Π° Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ сдвига Π΅Π΄ΠΈΠ½ΠΈΡ†Ρ‹ Ρ‡Π΅Ρ€Π΅Π· рСгистр. Π’ устройствС ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅Ρ‚ΡΡ Π΄Π²Π΅ Π΄Π²ΡƒΡ…Π²Ρ…ΠΎΠ΄ΠΎΠ²Ρ‹Π΅ схСмы совпадСния, линия Π·Π°Π΄Π΅Ρ€ΠΆΠΊΠΈ
    • …
    corecore